Sequencing and Raw Sequence Data Quality Control    ◾    31

length of the reads equal. However, sometimes reads with unequal reads are generated

especially if the reads are trimmed to remove low-quality bases at the beginning or ends

of the reads. The sequence length distribution graph shows the read length distribution.

If the reads are of the same length, the graph will be simple with a single peak at a bar

indicating a single value (Figure 1.22a). When reads are of a variable length, the graph will

show the relative read count of each read length (Figure 1.22b). A warning is displayed if

the reads do not have the same length.

1.5.9  Sequence Duplication Levels

The PCR may be used in sequencing step especially if the concentration of DNA is low,

in RNA-Seq and ChIP-Seq for enrichment. The PCR will increase the number of DNA

fragments; a single fragment is duplicated several times (exact match). However, well-cali-

brated sequencing instrument will produce, at the end, a single read for each of the library

fragments. Low sequence duplication level may indicate a high level of coverage. In con-

trast, the high level of duplication indicates a bias due to PCR amplification. The graph of

sequence duplication levels plots the percentages of reads against the sequence duplication

levels (number of duplicates). Only the first 200,000 reads in a FASTQ file are checked for

duplication to save computer memory. The number of duplicates is counted for each read.

A big rise may indicate the presence of a large number of reads with high levels of dupli-

cation. A warning is displayed if the number of duplicated reads is more than 20% of the

total. A failure sign is shown if the number of duplicate reads is more than 50% of the total.

Figure 1.23a shows that the majority of reads are unique. However, the number of dupli-

cated reads is more than 20% of the total reads; therefore, a warning is issued. Figure 1.23b

shows that the number of duplicated reads is more than 50% of the total; therefore, the

metric failed.

1.5.10  Overrepresented Sequences

The overrepresented sequences of genomic DNA will indicate a clear bias or contamination

due to adaptor dimers. However, in RNA-Seq, the overrepresented sequences can also be

FIGURE 1.22  Sequence length distribution graphs (equal length and variable lengths).